pacman::p_load(olsrr, ggpubr, sf, spdep, GWmodel, tmap, tidyverse, gtsummary, SpatialML,rsample,Metrics, jsonlite,httr,rvest,sp)Take-Home Exercise 3: Predicting HDB Public Housing Resale Pricies using Geographically Weighted Methods
1. Overview
Housing is an essential component of household wealth worldwide. Buying a housing has always been a major investment for most people. The price of housing is affected by many factors. Some of them are global in nature such as the general economy of a country or inflation rate. Others can be more specific to the properties themselves. These factors can be further divided to structural and locational factors. Structural factors are variables related to the property themselves such as the size, fitting, and tenure of the property. Locational factors are variables related to the neighbourhood of the properties such as proximity to childcare centre, public transport service and shopping centre.

Conventional, housing resale prices predictive models were built by using Ordinary Least Square (OLS) method. However, this method failed to take into consideration that spatial autocorrelation and spatial heterogeneity exist in geographic data sets such as housing transactions. With the existence of spatial autocorrelation, the OLS estimation of predictive housing resale pricing models could lead to biased, inconsistent, or inefficient results (Anselin 1998). In view of this limitation, Geographical Weighted Models were introduced for calibrating predictive model for housing resale prices.
1.1 The Task
In this take-home exercise, we are tasked to predict HDB resale prices at the sub-market level (i.e. HDB 3-room, HDB 4-room and HDB 5-room) for the month of January and February 2023 in Singapore. The predictive models must be built by using by using conventional OLS method and GWR methods. We are also required to compare the performance of the conventional OLS method versus the geographical weighted methods.
2. Data Acquisition Source
For the purpose of this take-home exercise, HDB Resale Flat Prices provided by Data.gov.sg should be used as the core data set. The study should focus on either three-room, four-room or five-room flat and transaction period should be from 1st January 2021 to 31st December 2022. The test data should be January and February 2023 resale prices.
Data summary table
| Type | Name | Format | Source |
|---|---|---|---|
| Aspatial | HDB Resale Flat Prices | .csv | data.gov.sg |
| Geospatial | Master Plan 2014 Subzone Web Boundary | .shp | data.gov.sg |
| Geospatial (locational factor) | Eldercare Service | .shp | data.gov.sg |
| Geospatial (locational factor) | Park & Open Space | .shp | data.gov.sg |
| Geosaptial (locational factor) | Supermarket | .csv | data.gov.sg |
| Geospatial (locational factor) | Hawker Centre | .geoson | data.gov.sg |
| Geospatial (locational factor) | Bus Stop Location | .shp | LTA Data Mall |
**Find more geospatial (locational factors) data
3. Getting Started
3.1 Installing and Loading the R packages
The R packages installed that we will be using for analysis are:
**(to be completed)
4. Data Wrangling: Geospatial Data & Aspatial Data
4.1 Importing Aspatial Data
Import Data
resale <- read_csv("data/aspatial/resale-flat-prices-based-on-registration-date-from-jan-2017-onwards.csv")Display Data
glimpse(resale)Rows: 148,000
Columns: 11
$ month <chr> "2017-01", "2017-01", "2017-01", "2017-01", "2017-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
$ block <chr> "406", "108", "602", "465", "601", "150", "447", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 4", "ANG MO K…
$ storey_range <chr> "10 TO 12", "01 TO 03", "01 TO 03", "04 TO 06", "0…
$ floor_area_sqm <dbl> 44, 67, 67, 68, 67, 68, 68, 67, 68, 67, 68, 67, 67…
$ flat_model <chr> "Improved", "New Generation", "New Generation", "N…
$ lease_commence_date <dbl> 1979, 1978, 1980, 1980, 1980, 1981, 1979, 1976, 19…
$ remaining_lease <chr> "61 years 04 months", "60 years 07 months", "62 ye…
$ resale_price <dbl> 232000, 250000, 262000, 265000, 265000, 275000, 28…
From the results above, we can tell that:
The dataset contains 11 columns with 148,000 rows in total.
The timeframe of the dataset is from 2017 January to 2023 February up to date.
The columns that are present in the data are: month, town, flat_type, block, street_name, storey_range, floor_area_sqm, flat_model, lease_commence_date, remaining_lease, resale_price.
In this take-home assignment, I selected HDB 4-room flat resale prices to analyse during the transaction period from 1st January 2021 to 31st December 2022. Therefore, we will need to filter and only extract data during this period of time frame.
4.1.1 Filter Resale Data (ONLY interested transaction period)
Filter Data
rs_subset <- filter(resale,flat_type == "4 ROOM") %>%
filter(month >= "2021-01" & month <= "2022-12")Display Date
glimpse(rs_subset)Rows: 23,657
Columns: 11
$ month <chr> "2021-01", "2021-01", "2021-01", "2021-01", "2021-…
$ town <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
$ flat_type <chr> "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", "4 ROOM", …
$ block <chr> "547", "414", "509", "467", "571", "134", "204", "…
$ street_name <chr> "ANG MO KIO AVE 10", "ANG MO KIO AVE 10", "ANG MO …
$ storey_range <chr> "04 TO 06", "01 TO 03", "01 TO 03", "07 TO 09", "0…
$ floor_area_sqm <dbl> 92, 92, 91, 92, 92, 98, 92, 92, 92, 92, 92, 109, 9…
$ flat_model <chr> "New Generation", "New Generation", "New Generatio…
$ lease_commence_date <dbl> 1981, 1979, 1980, 1979, 1979, 1978, 1977, 1978, 19…
$ remaining_lease <chr> "59 years", "57 years 09 months", "58 years 06 mon…
$ resale_price <dbl> 370000, 375000, 380000, 385000, 410000, 410000, 41…
From the results above, we can tell that:
We have successfully filtered our data based on earlier chosen HDB model flat and transaction period!
From January 2021 to December 2022, there are 23,657 transactions for 4-room flat in Singapore.
4.1.2 Transform Resale Data
After we have extracted the rows of transactions we are interested in, we will then proceed to use mutate function of dplyr package to create new variables (columns) in a data frame by applying some transformations to the existing columns.
What we will need to do is:
address: concatenation of the block and street_name columns using paste() function of base R package.
remaining_lease_yr & remaining_lease_mnth: Split the year and months part of the remaining_lease respectively using str_sub() function of stringr package then converting the character to integer using as.integer() function of base R package.
After performing mutate function, we will store the new data in rs_transform.
rs_transform <- rs_subset %>%
mutate(rs_subset, address = paste(block,street_name)) %>%
mutate(rs_subset, remaining_lease_yr = as.integer(str_sub(remaining_lease, 0, 2))) %>%
mutate(rs_subset, remaining_lease_mnth = as.integer(str_sub(remaining_lease, 9, 11)))After we have successfully added the three variables (address, remaining_lease_yr, and remaining_lease_mnth) into a new data named rs_transform, we will see some NA values in the remaining_lease_mnth column. Therefore, we will need to replace those with a value of 0 using is.na() function of base R package.
rs_transform$remaining_lease_mnth[is.na(rs_transform$remaining_lease_mnth)] <- 0
rs_transform# A tibble: 23,657 × 14
month town flat_…¹ block stree…² store…³ floor…⁴ flat_…⁵ lease…⁶ remai…⁷
<chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
1 2021-01 ANG MO… 4 ROOM 547 ANG MO… 04 TO … 92 New Ge… 1981 59 yea…
2 2021-01 ANG MO… 4 ROOM 414 ANG MO… 01 TO … 92 New Ge… 1979 57 yea…
3 2021-01 ANG MO… 4 ROOM 509 ANG MO… 01 TO … 91 New Ge… 1980 58 yea…
4 2021-01 ANG MO… 4 ROOM 467 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
5 2021-01 ANG MO… 4 ROOM 571 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
6 2021-01 ANG MO… 4 ROOM 134 ANG MO… 07 TO … 98 New Ge… 1978 56 yea…
7 2021-01 ANG MO… 4 ROOM 204 ANG MO… 07 TO … 92 New Ge… 1977 55 yea…
8 2021-01 ANG MO… 4 ROOM 429 ANG MO… 04 TO … 92 New Ge… 1978 56 yea…
9 2021-01 ANG MO… 4 ROOM 413 ANG MO… 10 TO … 92 New Ge… 1979 57 yea…
10 2021-01 ANG MO… 4 ROOM 415 ANG MO… 07 TO … 92 New Ge… 1979 57 yea…
# … with 23,647 more rows, 4 more variables: resale_price <dbl>, address <chr>,
# remaining_lease_yr <int>, remaining_lease_mnth <dbl>, and abbreviated
# variable names ¹flat_type, ²street_name, ³storey_range, ⁴floor_area_sqm,
# ⁵flat_model, ⁶lease_commence_date, ⁷remaining_lease
Now, as we scroll to the remaining_lease_mnth column, we noticed all initial “NA” values have been replaced by 0!
Next, we do not want to segregate the remaining lease in years and months columns. Instead, we could convert the remaining_lease_yr to months unit and create a new column call total_remaining_lease for easier analysis later using mutate function of dplyr package which contains the summation of the remaining_lease_yr and remaining_lease_mnth using rowSum() function of base R package.
Multiply remaining_lease_yr column in months unit
rs_transform$remaining_lease_yr <- rs_transform$remaining_lease_yr * 12Create a new column: total_remaining_lease to contain the summation of yr and mnth
rs_transform <- rs_transform %>%
mutate(rs_transform, total_remaining_lease = rowSums(rs_transform[, c("remaining_lease_yr", "remaining_lease_mnth")])) %>%
select(month, town, address, block, street_name, flat_type, storey_range, floor_area_sqm, flat_model,
lease_commence_date, total_remaining_lease, resale_price)Display head of data
head(rs_transform)# A tibble: 6 × 12
month town address block stree…¹ flat_…² store…³ floor…⁴ flat_…⁵ lease…⁶
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl> <chr> <dbl>
1 2021-01 ANG MO … 547 AN… 547 ANG MO… 4 ROOM 04 TO … 92 New Ge… 1981
2 2021-01 ANG MO … 414 AN… 414 ANG MO… 4 ROOM 01 TO … 92 New Ge… 1979
3 2021-01 ANG MO … 509 AN… 509 ANG MO… 4 ROOM 01 TO … 91 New Ge… 1980
4 2021-01 ANG MO … 467 AN… 467 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
5 2021-01 ANG MO … 571 AN… 571 ANG MO… 4 ROOM 07 TO … 92 New Ge… 1979
6 2021-01 ANG MO … 134 AN… 134 ANG MO… 4 ROOM 07 TO … 98 New Ge… 1978
# … with 2 more variables: total_remaining_lease <dbl>, resale_price <dbl>, and
# abbreviated variable names ¹street_name, ²flat_type, ³storey_range,
# ⁴floor_area_sqm, ⁵flat_model, ⁶lease_commence_date
Upon inspection of the rs_transform, we now only left with one column: total_remaining_lease that contains all the remaining lease in months!
4.1.3 Retrieve Postal Codes and Coordinates of Addresses
In this section, we will focus on retrieving the relevant data like postal codes and coordinates of the address which is required to get the proximity to locational factors in the later parts.
Here are the steps to add its longitude and latitude features with OneMapSG API!
Step 1: Create a list storing unique addresses
add_list <- sort(unique(rs_transform$address))Step 2: Create function to retrieve coordinates from OneMapSG API
(Add explanation of code here)
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
r <- GET('https://developers.onemap.sg/commonapi/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
# Send a GET request to OneMap API with address as searchVal,
# returnGeom as 'Y' to retrieve the coordinates, and getAddrDetails as 'Y' to retrieve the postal code
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Extract the 'found' and 'results' fields from the API reponses
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i, postal = postal, latitude = lat, longitude = lng)
}
}
else {
new_row <- data.frame(address= i, postal = NA, latitude = NA, longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}Step 3: Call get_coords function to retrieve resale coordinates
coords <- get_coords(add_list)coords$postal[is.na(coords$postal)]character(0)
coords$latitude[is.na(coords$latitude)]character(0)
coords$longitude[is.na(coords$longitude)]character(0)
Step 4: Combine resale and coordinates data
rs_coords <- left_join(rs_transform, coords, by = c('address' = 'address'))Step 5: Write file to rds
Now that out subset resale dataset is completed with the coordinates, we can now save it as an rds file!
rs_coords_rds <- write_rds(rs_coords, "data/aspatial/rds/rs_coords.rds")rs_coords <- read_rds("data/aspatial/rds/rs_coords.rds")rs_coords_sf <- st_as_sf(rs_coords,
coords = c("longitude",
"latitude"),
crs=4326) %>%
st_transform(crs = 3414)length(which(st_is_valid(rs_coords_sf) == FALSE))[1] 0
tmap_mode("view")
tm_shape(rs_coords_sf)+
tm_dots(col="red", size = 0.02)